187 research outputs found

    Entity matching with transformer architectures - a step forward in data integration

    Get PDF
    Transformer architectures have proven to be very effective and provide state-of-the-art results in many natural language tasks. The attention-based architecture in combination with pre-training on large amounts of text lead to the recent breakthrough and a variety of slightly different implementations. In this paper we analyze how well four of the most recent attention-based transformer architectures (BERT, XLNet, RoBERTa and DistilBERT) perform on the task of entity matching - a crucial part of data integration. Entity matching (EM) is the task of finding data instances that refer to the same real-world entity. It is a challenging task if the data instances consist of long textual data or if the data instances are "dirty" due to misplaced values. To evaluate the capability of transformer architectures and transfer-learning on the task of EM, we empirically compare the four approaches on inherently difficult data sets. We show that transformer architectures outperform classical deep learning methods in EM by an average margin of 27.5%

    Data Grid tutorials with hands-on experience

    Get PDF
    Grid technologies are more and more used in scientific as well as in industrial environments but often documentation and the correct usage are either not sufficient or not too well understood. Comprehensive training with hands-on experience helps people first to understand the technology and second to use it in a correct and efficient way. We have organised and run several training sessions in different locations all over the world and provide our experience. The major factors of success are a solid base of theoretical lectures and, more dominantly, a facility that allows for practical Grid exercises during and possibly after tutorial sessions

    Data Science für Lehre, Forschung und Praxis

    Get PDF
    Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch)Data Science ist in aller Munde. Nicht nur wird an Konferenzen zu Big Data, Cloud Computing oder Data Warehousing darüber gesprochen: Glaubt man dem McKinsey Global Institute, so wird es alleine in den USA in den nächsten Jahren eine Lücke von bis zu 190.000 Data Scientists geben. In diesem Kapitel beleuchten wir daher zunächst die Hintergründe des Begriffs Data Science. Dann präsentieren wir typische Anwendungsfälle und Lösungsstrategien auch aus dem Big Data Umfeld. Schließlich zeigen wir am Beispiel des Diploma of Advanced Studies in Data Science der ZHAW Möglichkeiten auf, selber aktiv zu werden

    SODA: Generating SQL for Business Users

    Full text link
    The purpose of data warehouses is to enable business analysts to make better decisions. Over the years the technology has matured and data warehouses have become extremely successful. As a consequence, more and more data has been added to the data warehouses and their schemas have become increasingly complex. These systems still work great in order to generate pre-canned reports. However, with their current complexity, they tend to be a poor match for non tech-savvy business analysts who need answers to ad-hoc queries that were not anticipated. This paper describes the design, implementation, and experience of the SODA system (Search over DAta Warehouse). SODA bridges the gap between the business needs of analysts and the technical complexity of current data warehouses. SODA enables a Google-like search experience for data warehouses by taking keyword queries of business users and automatically generating executable SQL. The key idea is to use a graph pattern matching algorithm that uses the metadata model of the data warehouse. Our results with real data from a global player in the financial services industry show that SODA produces queries with high precision and recall, and makes it much easier for business users to interactively explore highly-complex data warehouses.Comment: VLDB201

    Lessons learned from challenging data science case studies

    Get PDF
    In this chapter, we revisit the conclusions and lessons learned of the chapters presented in Part II of this book and analyze them systematically. The goal of the chapter is threefold: firstly, it serves as a directory to the individual chapters, allowing readers to identify which chapters to focus on when they are interested either in a certain stage of the knowledge discovery process or in a certain data science method or application area. Secondly, the chapter serves as a digested, systematic summary of data science lessons that are relevant for data science practitioners. And lastly, we reflect on the perceptions of a broader public towards the methods and tools that we covered in this book and dare to give an outlook towards the future developments that will be influenced by them

    Toward automatic data curation for open data

    Get PDF
    In recent years large amounts of data have been made publicly available: literally thousands of open data sources exist, with genome data, temperature measurements, stock market prices, population and income statistics etc. However, accessing and combining data from different data sources is both non-trivial and very time consuming. These tasks typically take up to 80% of the time of data scientists. Automatic integration and curation of open data can facilitate this process

    Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

    Full text link
    The current boom of learned query optimizers (LQO) can be explained not only by the general continuous improvement of deep learning (DL) methods but also by the straightforward formulation of a query optimization problem (QOP) as a machine learning (ML) one. The idea is often to replace dynamic programming approaches, widespread for solving QOP, with more powerful methods such as reinforcement learning. However, such a rapid "game change" in the field of QOP could not pass without consequences - other parts of the ML pipeline, except for predictive model development, have large improvement potential. For instance, different LQOs introduce their own restrictions on training data generation from queries, use an arbitrary train/validation approach, and evaluate on a voluntary split of benchmark queries. In this paper, we attempt to standardize the ML pipeline for evaluating LQOs by introducing a new end-to-end benchmarking framework. Additionally, we guide the reader through each data science stage in the ML pipeline and provide novel insights from the machine learning perspective, considering the specifics of QOP. Finally, we perform a rigorous evaluation of existing LQOs, showing that PostgreSQL outperforms these LQOs in almost all experiments depending on the train/test splits

    Database search vs. information retrieval : a novel method for studying natural language querying of semi-structured data

    Get PDF
    The traditional approach of querying a relational database is via a formal language, namely SQL. Recent developments in the design of natural language interfaces to databases show promising results for querying either with keywords or with full natural language queries and thus render relational databases more accessible to non-tech savvy users. Such enhanced relational databases basically use a search paradigm which is commonly used in the field of information retrieval. However, the way systems are evaluated in the database and the information retrieval communities often differs due to a lack of common benchmarks. In this paper, we provide an adapted benchmark data set that is based on a test collection originally used to evaluate information retrieval systems. The data set contains 45 information needs developed on the Internet Movie Database (IMDb), including corresponding relevance assessments. By mapping this benchmark data set to a relational database schema, we enable a novel way of directly comparing database search techniques with information retrieval. To demonstrate the feasibility of our approach, we present an experimental evaluation that compares SODA, a keyword-enabled relational database system, against the Terrier information retrieval system and thus lays the foundation for a future discussion of evaluating database systems that support natural language interfaces
    corecore